Correlating metasurface spectra with a generation-elimination framework

Inferring optical response from other correlated optical response is highly demanded for vast applications such as biological imaging, material analysis, and optical characterization. This is distinguished from widely-studied forward and inverse designs, as it is boiled down to another different category, namely, spectra-to-spectra design. Whereas forward and inverse designs have been substantially explored across various physical scenarios, the spectra-to-spectra design remains elusive and challenging as it involves intractable many-to-many correspondences. Here, we first dabble in this uncharted area and propose a generation-elimination framework that can self-orient to the best output candidate. Such a framework has a strong built-in stochastically sampling capability that automatically generate diverse nominations and eliminate inferior nominations. As an example, we study terahertz metasurfaces to correlate the reflection spectra from low to high frequencies, where the inaccessible spectra are precisely forecasted without consulting structural information, reaching an accuracy of 98.77%. Moreover, an innovative dimensionality reduction approach is executed to visualize the distribution of the abstract correlated spectra data encoded in latent spaces. These results provide explicable perspectives for deep learning to parse complex physical processes, rather than “brute-force” black box, and facilitate versatile applications involving cross-wavelength information correlation.

1. The complexity of the proposed machine learning method lacks justification. (e.g. Why can't a many-to-many problem be approached by two standard one-to-many networks?) A simple algorithm-independent formulation of the many-to-many problem at hand is missing.

Authors Responses 1:
We appreciate the referee for the constructive comment. Below we will specifically justify the complexity of our generation-elimination framework, beyond two standard one-to-many networks.
Foremost, we want to illustrate how we realize the many-to-many mapping between correlated optical responses. The trained generation and elimination networks provide two latent spaces that perform one-tomany mappings, and the decoders that can transform the latent variables into their respective spectra spaces (i.e., high-dimensional spectra data). To realize the fusion of two spectra spaces, a bifurcating tree is exploited to build up a two-way mutual selection between low-frequency and high-frequency spaces. As illustrated in Fig. R1, for a given input (low-frequency), the generation network can generate various candidates (highfrequency). But, the question is how to determine the optimal candidate. To solve this, the elimination network is used to reversely map each candidate into the original space. Based on the two-step bifurcating tree, the optimal candidate is picked out by calculating the Euclidean distance between the input and secondary R2 candidates. In this way, the two-way mutual selection and many-to-many mapping are realized without consulting structural information to evaluate the inferred candidates/spectra. Each standard one-to-many network has an inherent error because of the deviation of distributions and bias of sampling. It is worth mentioning the KL divergence loss term in Eq. 1 of the main text, which competes with the reconstruction loss until convergence and forces the true latent distribution to be the same as the standard Gaussian prior distribution. Although it affords the leverage to sample from standard Gaussian distribution during the inference phase and brings diverse nominations, it will inevitably cause errors in the spectral correlation because of the non-negligible deviation of sampling between the true latent distribution and standard Gaussian prior distributions. Thus, if we directly adopt two one-to-many networks, the error will be accumulated and magnified. As shown in Fig. R1, the inferior candidates can be excluded by comparing the distances, and the error will be eliminated to the greatest extent through the two-way mutual selection. In traditional inverse design based on one-to-many networks, the screening process that involves structure information is also indispensably needed, otherwise, they cannot evaluate the error and verify the rationality of the solution.
In conclusion, in our framework, we advantageously make full use of the conventional one-to-many network and compensate for its limitation when adapted to (meta-)materials design. The inherent error is delicately eliminated by the mutual selection of two spectra spaces through a bifurcating tree. The idea is not duplicated from but in-turn facilitates the studies in computer science and the framework could be readily extended to other research domains of photonic design.
In the new submission, we have added the above discussion, On lines 63-66, "Besides, the inherent bias in conventional one-to-many network cannot be evaluated and mitigated in virtue of the structural information anymore. The direct connection of two standard one-to-many networks will cause the magnification of the error." On lines 73-75, "the operation process is understandable as a hierarchical bifurcating tree, which builds up a two-way mutual selection between two spaces and eliminates the inherent error to the greastest extent." On lines 220-222, "Second, the KL divergence term in Eq. 1 reveals that there is a deviation between the true latent distribution and standard Gaussian prior distribution. It means that the latent variables, though sampled from the acceptable region, are not guaranteed to retrieve highly precise nominations." On lines 234-237, "With the delicately designed bifurcating tree, the bidirectional mapping and two-way selection are built up between low-frequency and high-frequency spaces, and the inferior candidates can be excluded by comparing the Euclidean distance between the input and secondary candidates in the lowfrequency space."

R3
Fig. R1 | The two-way mutual selection between low-frequency and high-frequency spaces. The trained generation and elimination networks offer two latent spaces that perform unidirectional one-to-many mappings between two spectra spaces. Based on the two-step bifurcating tree, the two-way mutual selection is realized and the deviation is mitigated to the greatest extent. Dist. is the abbreviation of distance and cand.
is the abbreviation of candidata.

Referee #1 --Comment 2:
2. There is no systematic quantified evaluation on some representative set of meta-materials or optical responses. Only anecdotal visual evidence is provided for the desired behavior of the proposed ML method.
There is insufficient comparisons to simpler ML baselines.

Authors Response 2:
Thanks for the good suggestion. As a complement to the visual evidence, we use three criteria to systematically quantify the performance of our framework. Meanwhile, we take the fully-connected network as a simpler ML baseline model and evaluate it with the same criteria to highlight the superiority of our framework.
(1) MSE: the mean square error between the predicted spectra and ground truth spectra.
(2) Average accuracy (1 − ℯ !"# ) × 100%, where ℯ !"# is defined as the average relative error between the predicted spectra and ground truth spectra, that is, represents the th data point of the ground truth (predicted) spectra, and is the number of spectral points.
(3) Similarity: the correlation coefficient between two vectors/curves, defined as, where 9 ( ' : ) represents the mean of the ground truth (predicted) spectra over all points & ( & ' ). order of magnitude larger than our framework, indicating the failure of convergence and the invalidity of tackling the non-uniqueness predicament. Besides, the average accuracy and the similarity give more intuitive comparisons between the predicted spectra curves and the ground truth. Both criteria exhibit much higher accuracies when our framework is adopted.

R4
Furthermore, we have also shown the quantified summary statistics on another two geometry patterns (arc and distorted h-shape); see Referee #2 --Comment 3 in Table R2 on page R8. Both the low MSE errors and high accuracies in all geometry cases further exemplify the generality and versatility of our framework.
In the new submission, we have incorporated Table R1 as Fig. 5c and added the above description, On lines 19-20, "where the inaccessible spectra are precisely forecasted without consulting structural information, reaching an accuracy of 98.77%." On lines 241-255, "To quantify the outstanding performance of our framework, … generality and versatility of the proposed framework." On lines 469-470, "The quantitative comparisons of three criteria on the elliptical dataset when trained with the baseline FCN model and our framework, separately." 3. There is no clear worked through example of how this method can be concretely applied in practice.

Authors Response 3:
Thanks for the good question. In this work, we proposed a generation-elimination framework for the spectral correlation, where the high-frequency response of the metasurfaces is inferred from the low-frequency response with an accuracy of 98.77%. The major advantage is to address many-to-many mapping in deep learning. Such advantage brings many benefits. First, largely reducing the meshed design space and simulation time by only utilizing low-frequency data. Second, extracting the desired images, spectra, and material features from other easily accessible information at a low cost [Appl. Sci. Res. 10, 5790-5795 (2020)].
Apart from its application in metasurfaces, the spectral correlation can be generalized in various fields. To be specific, we use synthetic aperture radar (SAR) to show how this method can be concretely applied in practice, as shown in Fig. R2. Typically, the SAR system generates copious amounts of complex raw data that require intense computation to produce high-resolution images. In some real-time applications, such as navigation decisions, interplanetary missions, and unmanned aerial vehicles, the resolution must be sacrificed for on-thefly imaging processing, resulting in fuzzy images like the low-resolution rabbit picture in Fig. R2. However, by using our spectral correlation technique, the low-resolution rabbit picture obtained from the SAR system is possible to be transitioned to the high-resolution one with negligible processing time [Proc. SPIE 8714, 871416 (2013)]. For more applications, such as in communications, the noise in the high-frequency band could be wiped off in conjunction with the low-frequency correlation [ICASSP 636-640 (2018)]. Also, as pointed out by Referee #3, "[the problem] is potentially interesting in various research areas where extrapolation of data to higher or lower frequency ranges is often needed, for example in the application of Kramers-Kronig".
In the new submission, we have added the above discussion, On lines 55-57, "This is instrumental in a wide range of applications, for example, to extract the desired images, spectra, and material features from other easily accessible information with the advantages of low cost and easy measurement 25 ." On lines 88-90, "For example, in communication and Raman spectroscopy, in conjunction with the lowfrequency correlation, the noise in the high-frequency region can be largely wiped off [39][40] ." On lines 92-94, "And as a more concrete example, the low-resolution images obtained from synthetic aperture radar (SAR) system could be transitioned to the high-definition one with negligible processing time for some real-time applications 42 ." the low-resolution rabbit image obtained by low-frequency SAR detection is possible to be converted into a high-resolution image.

Referee #1 --Comment 4:
Overall, I believe that manuscript does not meet the quality standards for publication in this journal.

Authors Response 4:
We apologize for the possible confusion on the practicality and technical issues, and sincerely appreciate the referee for all the constructive comments as they help strengthen the idea presented in the paper. In this work, we for the first time dabble in this many-to-many mapping dilemma and propose a groudbreaking and effective method to cope with it. From the methodology, the inherent error in conventional one-to-many network is delicately eliminated by the mutual selection of two spectra spaces. From the application viewpoint, we conceptualize the third-class metasurface design where the correlated metasurface spectra are precisely forecasted, and we believe it will bring widespread and far-reaching implications to various research domains where extrapolation of spectra data is needed or intractable many-to-many correspondences are involved. We anticipate the response and the updated manuscript could remove the referee's confusion. The authors report a generation-elimination framework and apply it to metasurface design. In particular, a spectra-to-spectra design idea is proposed. To distinguish it from forward and inverse designs, they term it the third-class metasurface design. This idea is innovative and this work is suitable for publication in Nature Communications. I suggest accept with minor revisions. Just some minor comments here.

Authors Response:
We thank the referee for his/her positive comments and for identifying the novelties of our work. In the following, we will address the specific questions point-by-point.

I would like to suggest the authors compare their works with other related literatures to further benchmark
the superiority of this work.

Authors Response 1:
Thanks for the good suggestion. We have followed the referee's suggestion and compared our work with other On lines 58-63, "Achieving third-class metasurface design faces a formidable challenge because it involves complex many-to-many mapping, rather than simply one-to-one and one-to-many mapping 26,27 … In doing so, conventional deep learning algorithms will become conflicted on how to adjust learnable weights and the convergence can not be guaranteed 28,29 ." Referee #2 --Comment 2: 2. The establishment of latent space requires further elaboration. Although the latent distribution has been explained in this paper, I think it is still necessary to supplement the rationality of the selection of Gaussian function.

Authors Response 2:
We appreciate the referee for pointing this out. Actually, the normal distribution is not the only option that can be employed for latent variables in VAEs. There are many other options, such as von Mises-Fisher However, the normal distribution is still the most widely used in subsequent studies because it has many nice properties, such as analytical evaluation of the KL divergence in the variational loss (Eq. S9-S11 in the supplementary materials), and efficient gradient computation of the reparameterization trick. Moreover, one of the apparent advantages of VAE is that it allows the generation of new samples by sampling in the latent space-which would be quite easy if it follows Gaussian distribution.
In the new submission, we have added the above discussion on lines 134-136, "We choose Gaussian as the prior distribution because of its analytical evaluation of the variation loss (see Eq. S9-S11 in Supplementary Note 1), and the convenience of performing sampling in the latent space." Referee #2 --Comment 3:

The elliptic structure is used as the verification and the supplementary materials provide more structures.
Please discuss how different structures affect the generality of the model or explain the applicability of the network in terms of some structures.

Authors Response 3:
Thanks for the suggestion. To verify the generality of our model, we use three quantitative indicators to evaluate the performance of our model on different structures. (1) MSE, the mean square error between the predicted spectra and ground truth spectra.
(2) Average accuracy, (1 − ℯ !"# ) × 100%, where ℯ !"# is defined as the average relative error between the predicted spectra and ground truth spectra, that is, represents the th data point of the ground truth (predicted) spectra, and is the number of spectral points.

Authors Response 5:
We thank the referee for the suggestion and have appended the relevant algorithms in the corresponding section. For the elliptical patterns, we first sample over its three parameters (i.e., major axis, minor axis, and rotation angle) with for loops, and then invoke the cv2.ellipse function to draw the geometry. Similarly, for the arc and distorted h-shaped patterns, we sample over their corresponding parameters in for loops and call matplotlib.pyplot.Arc and cv2.rectangle functions to draw the geometries. See below for three relevant and complete algorithms that are written in pseudo-code; the main built-in packages (including rotation and distortion) are underlined for clarity.
In the new submission, we have appended these pattern-generating algorithms in Supplementary Note 6, and noted them in the main text.
On lines 279-280. "The relevant pattern-generating algorithms and the detailed illustrative procedure are included in Supplementary Note 6."

Authors Response:
We thank the referee for the encouraging comments and for pointing out some novelties of our work. In the following, we will address the specific comments point-by-point whilst revising our manuscript.

Specific comments from Referee #3:
Referee #3 --Comment 1: The paper cannot be accepted due to the following major issue: There are no details given of the error between the network prediction and ground truth. The authors only provide visual comparisons. Further, there is no summary statistics given for the test set. Both of these are obligatory in the field of deep learning. Thus, there is no way to judge on the performance of their work.

Authors Response 1:
Thanks for the good question. As a indispensable complement to the visual comparisons, we define three indicators to systematically quantify the performance of our framework.
(1) MSE: the mean square error between the predicted spectra and ground truth spectra.
(2) Average accuracy (1 − ℯ !"# ) × 100%, where ℯ !"# is defined as the average relative error between the predicted spectra and ground truth spectra, that is, represents the th data point of the ground truth (predicted) spectra, and is the number of spectral points.
(3) Similarity: the correlation coefficient between two vectors/curves, defined as, where 9 ( ' : ) represents the mean of the ground truth (predicted) spectra over all points & ( & ' ). In the new submission, we have added Table R3 as Fig. 5c and above discussion, On lines 19-20, "where the inaccessible spectra are precisely forecasted without consulting structural information, reaching an accuracy of 98.77%." On lines 241-255, "To quantify the outstanding performance of our framework, … generality and versatility of R11 the proposed framework." On lines 469-470, "The quantitative comparisons of three criteria on the elliptical dataset when trained with the baseline FCN model and our framework, separately." We appreciate the referee for raising these questions. Actually, we have tried FCN at the very beginning, but found it fails to solve our question due to the existing many-to-many issue in spectral correlation. Below, we will specifically address these raised questions one-by-one.
The detailed architecture of FCN is shown in Table R4. The baseline model is composed of 9 fully connected layers with 300 hidden neurons in each layer. Correspondingly, the training (orange line) and validation (blue line) loss curves are plotted in Fig. R3. The relatively large fluctuations in the loss curves indicate the failure of convergence. In other words, even though the loss curves incline, we cannot read from which point (the convergence point) the error is within a relatively small range. Besides, the validation error stays at a high level (around 4e -3 measured in MSE), compared to 5e -4 in our proposed framework. The error of nearly one order of magnitude not only proves that the FCN fails to converge, but also exposes its performance gap with our framework when confronted with the same bidirectional non-uniqueness predicament.
As suggested, we further increase the number of layers to 10 and 11 as larger models and train them for 30,000 epochs. The complete learning loss curves are plotted in Fig. R4. The non-convergence problem is not eased with the validation error still vigorously fluctuating at a relatively high value (around 5e -3 measured in MSE).
For the last question "If you use three FCNs …", we use three independent FCN networks (each with the same configuration as the baseline model in Table R4, except for the input size and output size) to train Rxx, Rxy and Ryy, separately. Overall, we believe and have proven that the FCN model is bound to be in the cart when faced with complicated many-to-many mapping. And from the theoretical viewpoint, it is incompetent of tackling the non-convergence problem.
In the new submission, we have added Table R4 as Table S3, Fig. R3 as Fig. S3, Table R5 as Table S4, and the above discussion in Supplementary Note 2, and noted them in the main text, On lines 207-208, "see Supplementray Note 2 for more details."

Referee #3 --Comment 3:
What is the reconstruction loss on the testing set with your combined two-step networks? In your dataset, the low frequency spectra y has the direct corresponding high spectra x. So, what is the loss when you pass any y to the generator and eliminator to get the optimal output x ̂ compared with x?

Authors Response 3:
This is a very good question. Firstly, we want to explain the training process and the reconstruction loss of our R14 generation-elimination framework. As shown in Fig. R5a, such framework consists of two cascaded networks, namely, generation network and elimination network. We train these two networks independently with their own reconstruction loss, or more accurately, with their own objective loss function. For the generation network, it takes both (high spectra) and (low spectra, the label) as inputs, and finally decoded into '. The reconstruction loss of the generation network is calculated here as the distance between and ', which is only used to update the generation network. Similarly, for the elimination network, it takes both (the label) and as inputs, and finally decoded into '. The elimination network's reconstruction loss is calculated as the distance between and ', and used to update the elimination network.
When combining these two networks together, we do not need to consider additional reconstruction loss. The two networks are combined by a hierarchical bifurcating tree without further joint training. After both subnetworks have been trained independently, only their decoders are retained and will be utilized in the inference phase, as implied in Fig. R5a. For any input 3#43 , it will firstly be decoded into father nodes ( 3#43 ' ) by the generator's decoder, and then each father node will give birth to leaf nodes ( 3#43 ' ) by the eliminator's decoder. Referring to the tree diagram in Fig. R5b, the optimal output 3#43 ' is selected by finding the minimum distance between the root node (input 3#43 ) and leaf nodes ( 3#43 ' ). Besides, the value of the minimal distance would indicate the compatibility of the obtained optimal solution, that is, whether there should be a corresponding reasonable and logical output for the random input.
In the new submission, we have clarified this point, On lines 168-170, "When two cascaded networks have been trained independently, they will be combined by a hierarchical bifurcating tree without further joint training." On lines 233-234, "Besides, the value of the minimal distance will indicate the rationality of the obtained optimal solution for any random input." The optimal solution 3#43 ' is selected by calculating and finding the minimal distance between the root node 3#43 and leaf nodes 3#43 ' .

Referee #3 --Comment 4:
What is the dimension of the latent space for your other two? I assume latent space with two variables are sufficient is due to your input degree of freedom is only three (two radiuses and a rotation angle). How about larger input dimension, say the periodicity and thickness are changing? Does your latent space visualization method still hold?

Authors Response 4:
Thanks for the good question. We understand that "your other two" pointed out by the referee refers to another two geometries, the arc and distorted h-shape geometrical patterns, which are also trained with two dimensions (the same setting as the elliptical one). Here, we set the dimension as two merely because the latent space consisting of two-dimensional standard Gaussian variables can achieve relatively high accuracy, and the input degree of freedom (DOF) is not the main factor that determines the dimension. We trained our model with the dimensionality of 5 and 10, separately, on the elliptical dataset. The results can be seen in Table   R6. For both generative and eliminative networks, the ultimate validation errors are very similar in all cases, which verifies two dimensions are sufficient for the elliptical dataset. Considering the time cost and efficiency maximization, we set the dimension of our latent space as two.
As for the larger input dimension, taking the distorted h-shape as an example, its DOF is at least eight (main_width, main_height, side_width1, side_width2, side_height, angle, add_angle, distort_degree). We also trained it with two dimensions and achieved an accuracy of 98.57%.
Finally, we want to note that, in our latent space visualization task, the difficulty does not lie in the dimension of the latent space, but in how to leverage the abstract high-dimensional spectra data to colorize/feature the latent space, to further prove the sub-network can conditionally encode the input and output spectra into a compact but informative space. Furthermore, when the dimension of latent space is larger than two, we can still use a dimensionality reduction approach, such as PCA or t-SNE method (and also the auto-encoder approach we use to extract spectra representation), to reduce the dimensionality to two or three for visualization. In the inverse design work [Adv. Mater. 31, 1901111 (2019)], the latent space dimension is reduced from 20 to 2 by using the t-SNE method for visualization.
In the new submission, we have clarified this point in Supplementary Note 5, On page 11, "Even though the degree of freedom is increased (at least eight for the distorted h-shape pattern) for both cases, the dimension of the latent space is still set as two and can achieve relatively high accuracy." And added it the main text, On lines 138-139, "considering the time cost and efficiency maximization"

Table R6 | The ultimate validation errors of two sub-networks upon different dimensionality values
Referee #3 --Comment 5: It was not clear to me how you constructed the ellipse geometry in CST. In the manuscript, it seems like the geometry gets discretized to finite pixel and import that into CST? Is there a specific reason to discretize those geometries when CST can directly model continuous boundaries?

Authors Response 5:
Thanks for the careful reading. Regarding the question "It was not clear to me…import that into CST?", the answer is yes. The detailed procedure of constructing the ellipse geometry (as well as other geometries) is depicted in Fig. R6; see below.
Step 1: We firstly create the binary black/white images in Python by sampling over all possible design parameters, such as length, width, diameter and rotation angle of different parts of the geometries, utilizing built-in Python packages such as Matplotlib and OpenCV to draw these binary images. The detailed algorithms with the pseudo-code are elaborated in Referee #2 --Comment 5 on pages R8-R9.
Step 2: Then, these binary images are exported to individual .txt files as 64 × 64 matrixes (i.e., one matrix in a file), which are prepared to be taken as inputs in the next step using the MATLAB-CST co-simulation method.
Step 3: In the numerical simulation, MATLAB reads these binary matrixes from.txt files and constructs them in CST, where '1' stands for gold and '0' stands for air. Other parameters such as periodicity and thickness are also pre-defined in MATLAB (see Methods in the main text for more detail), to control the modeling in CST and then numerically calculate the spectra.  In the new submission, we have incorporated Fig. R6 as Fig. S8 and above procedure in Supplementary Note 6, and noted them in the main text, On lines 279-280, "The relevant pattern-generating algorithms and the detailed illustrative procedure are included in Supplementary Note 6." Regarding the question "Is there a …", we will explain it in the following: When CST models the geometries, they are actually constructed as discrete pixels instead of continuous boundaries, yet have a higher resolution compared with our 64 × 64 binary images. Fig. R7 shows the simulated spectra of the same pattern constructed in two different methods. The negligible error between the two spectra curves verifies the feasibility of our discretization method. Besides, the discretization method is actually widely used in intelligent metasurface designs [e.g. Adv. Mater. 31, 1901111 (2019), Nano Lett. 18, 6570-6576 (2018)]. It allows for the generation of essentially arbitrary geometries. In our scenario, taking the distorted h-shape pattern as an example, we invoke the built-in packages in Python to control the degree of distortion, which can hardly be described by a few fixed parameters. That is, the discretization method not only facilitates the data collection, but also helps enrich and diversify the dataset and thus validate the generalization of our model.
In the new submission, we have added the above discussion on lines 282-285, "Instead of using fixed parameterization, we treat metasurfaces as 2D pixel-wise images. In this way, arbitrary design shapes can be properly represented, avoiding the limitations of some existing practices that are only able to represent fixed structures with few geometrical parameters."

REVIEWER COMMENTS
Reviewer #1 (Remarks to the Author): The authors addressed my main concerns, in particular, clarifying the purpose and meaning of the manyto-many approach, and providing quantified evidence that the many-to-many approach leads to a significant accuracy improvement. However, I think the paper could still be much improved w.r.t. clarity, structure, and discussion of related work, Furthermore, results would benefit from being further extended.
Detailed comments: -The proposed ML approach seem to be related to the CycleGAN, in particular, to the concept of "cycle consistency". It seems to me that the bifurcation tree selection can be seen as a cycle consistency criterion applied at test time. It would be good to mention this related work in order to highlight that there are further ML tools that could potentially be investigated for such many-to-many mappings.
-The results section currently discusses the VAE/ELBO extensively. However, since the VAE/ELBO is an existing method, I think that such discussion would fit better in the methods section at the end of the paper. I would expect the results section to be more focused on the way the two VAEs are used together for the purpose of the application, in particular, highlighting the interconnection between the two VAEs and the bifurcation tree (cycle consistency) criterion.
-Furthermore, the clarity of the technical section could be improved. Currently, figures do not include the notation and terms used in the paper, and therefore, they do not facilitate the understanding of the technical sections of the paper. Some of the figures are unintuitive, for example, it took me time to realize in Fig. 2a (top) that the blue box is what we actually feed as input, and that the orange boxes at the input and output is some kind of recurrent connection. Maybe a more abstract subfigure showing in a very simple way how the two VAEs interconnect would be useful. Actually, I like Fig. R1 from the authors' response letter, and think it could potentially be used for the main paper. Compared with our bifurcation tree selection, the "cycle consistency" introduced in CycleGAN is designed for different technical purpose. As an upgraded version of the pix2pix model [CVPR 1125[CVPR -1134[CVPR (2017], CycleGAN eliminates the need for paired training data and works as a unsupervised model. In this regard, the incorporation of "cycle consistency" becomes essential to preserve content consistency when performing image style transfer. Besides, CycleGAN has found practicality in applications where data from each domain has inherent shared characteristics/style, such as object transfiguration and season transfer. It realizes the macroscopic and high-level feature mappings between two domains. By contrast, we adopt the idea of "cycle consistency" (i.e., Euclidean distance) to screen out the optimal candidate in our spectra correlation, achieving the precise matching of data pairs at a microscopic level of many-to-many mappings.
In the new submission, we have highlighted the CycleGAN model with its cycle consistency criterion and added the above discussion on lines 234-240, "The procedure is somehow similar to the "cycle consistency" criterion in the unsupervised model CycleGAN 45 , which also minimizes the distance between the original data and the cycle-transferred data in order to preserve content consistency while performing image style transfer. It requires the data from each domain to possess inherent shared characteristics (i.e., style). By contrast, we use "cycle consistency" for the precise matching of data pairs at a microscopic level of complex many-to-many mapping. The value of the minimal distance will also indicate the rationality of the obtained optimal solution for any random input."

Referee #1 --Comment 2:
The results section currently discusses the VAE/ELBO extensively. However, since the VAE/ELBO is an existing method, I think that such discussion would fit better in the methods section at the end of the paper. I would expect the results section to be more focused on the way the two VAEs are used together for the purpose of the application, in particular, highlighting the interconnection between the two VAEs and the bifurcation tree (cycle consistency) criterion.

Authors Response 2:
Thanks for the good suggestion. According to the referee's comments, we have simplified the description about the VAE/ELBO and included it in Methods and Supplementary Materials. Meanwhile, we also highlight the interconnection between two VAEs and the bifurcation tree in the result section; see below.
On lines 113-117, "When two cascaded networks have been trained independently, their decoders will be combined by a hierarchical bifurcating tree. As shown in Fig. 2b, the bidirectional mapping and two-way selection are built up between low-frequency and high-frequency spaces, and the inferior candidates can be excluded by comparing the Euclidean distance between the label input and secondary candidates in the lowfrequency space." On lines 226-228, "Furthermore, we exploit a tree diagram to clarify how we combine the generation "magnifier" and elimination "magnifier" to build up the bidirectional mapping between two spectra spaces in Fig. 2b and obtain the ideal solution." On lines 230-234, "The core is that the father node with the most orthodox leaf node will be chosen as the final output, and "the most orthodox" is defined as the minimum distance between the root node and the leaf node, or alternately say, the minimum Euclidean distance between the input and secondary candidates in Fig.   2b."

Referee #1 --Comment 3:
Furthermore, the clarity of the technical section could be improved. Currently, figures do not include the notation and terms used in the paper, and therefore, they do not facilitate the understanding of the technical sections of the paper. Some of the figures are unintuitive, for example, it took me time to realize in Fig. 2a (

Authors Response 3:
We thank the referee for the careful reading and the constructive suggestion. As suggested, we have included the notations and terms in Fig. 2a and Fig. 2c to facilitate the understanding. Besides, we have incorporated Fig. R1 from the last response letter as Fig. 2b in the main text, and added the corresponding description to highlight the interconnection between two VAEs; see below.
On lines 125-127, "Meanwhile, the "input" we mentioned in Fig. 1 corresponds to the "Label" (y) in both Fig.   2a and Fig. 2c. It serves as another input element as indicated by the blue boxes." On line 164, "and then the variables are randomly sampled from the selected Gaussian distribution and decoded into candidates, constituting the solution domain in Fig. 2b." On lines 444-448, "b, The interconnection between two networks. For a given input (low-frequency), the generation network can generate various candidates (high-frequency), and the elimination network will reversely map each candidate into the original space. The optimal candidate is picked out by calculating the Euclidean distance between the input and secondary candidates."

Referee #1 --Comment 4:
While the authors have added a quantified comparison, I think the results section remains a bit thin, in particular, it would be good to have at least two distinct data generation / numerical simulation scenarios in order to demonstrate that the high reported performance is reproducible under slightly varying conditions.

Authors Response 4:
Thanks for the good comment. To verify the effectiveness of our method, we further demonstrate two distinct data generation/scenarios. The first case is that we use different metasurfaces geometries. Figure R2 demonstrates the results of arc and distorted h-shape metasurface patterns, respectively. The first plot in each row is the input spectra from the test dataset, and the second plot is the diverse candidates generated by the generation network. The last plot in each row is the final solution singled out by the elimination network, where the solid line and inset are the ground-truth optical response and design pattern, respectively. The close match between the dashed line and solid line in all cases proves the generality of our model. The second case is the global far-field customization of intelligent metasurfaces, for the application of fifthgeneration (5G) wireless communication. Intelligent metasurfaces have been found to be a superior candidate for managing wireless channels in a green and cost-effective manner [Sci. Adv. 8, eabn7905 (2022)]. As shown in Fig. R3a, the goal is to steer the main lobe towards the user's direction. For a given direction, our generationelimination framework will firstly generate various nominated metasurface distributions, and then filter out the optimal one with the criterion of minimal distance regarding the main lobe. As shown in Fig. R3b, the blue curve in each test instance is the input/desired main lobe, and the orange curve is the simulation result of the optimal metasurface design filtered by the elimination network. In both cases, the close match between the orange curve and the blue one proves the versatility of our framework. All I concerned in my last comments have been addressed in this revised manuscript, which is quite satisfying.
I recommend acceptance of the revised version.

Authors Response:
We thank the referee for the positive comments and the recommendation of our work.